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Abstract. In this paper, we address the problem of how many randomly labeled 
patterns can be correctly classified by a single-layer perceptron when the patterns are 
correlated with each other. In order to solve this problem, two analytical schemes 
are developed based on the replica method and Thouless- Anderson-Palmer (TAP) 
approach by utilizing an integral formula concerning random rectangular matrices. 
The validity and relevance of the developed methodologies are shown for one known 
result and two example problems. A message-passing algorithm to perform the TAP 
scheme is also presented. 
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1. Introduction 

Learning from examples is one of the most significant problems in information science, 
and (single-layer) perceptrons are often included in widely used devices for solving 
this problem. In the last two decades, the structural similarity between the learning 
problem and the statistical mechanics of disordered systems has been observed, thus 
promoting cross-disciplinary research on perceptron learning with the use of methods 
from statistical mechanics pfl [2] . This research activity has successfully contributed to 
the finding of various behaviors in the learning process of perceptrons [31 IU [5] and to the 
development of computationally feasible approximate learning algorithms [61 [7] that had 
never been discovered by conventional approaches in information science, particularly 
for the non-asymptotic regimes in which the ratio between the numbers of examples p 
and weight parameters N, a = p/N, is 0(1). 

Although such statistical mechanical methodologies have been successfully applied 
to learning problems, there still remain several research directions to explore. Learning 
from correlated patterns is a typical example of such a problem. In most of the earlier 
studies, it was assumed, for simplicity, that the input patterns used for learning were 
independently and identically distributed (IID) [3, HI [5] . However, this assumption is 
obviously not practical since real-world data is usually somewhat biased and correlated 
across components, which makes it difficult to utilize the developed schemes directly 
for learning beyond a conceptual level. In order to increase the practical relevance of 
the statistical mechanical approach, it is necessary to generalize the approach to handle 
correlated patterns. 

As a first step for such a research direction, we address the problem of correctly 
classifying many randomly labeled patterns by a single-layer perceptron when the 
patterns are correlated with each other. In data analysis, problems of this kind are 
of practical importance as an assessment of null hypotheses that state no regularity 
represented by the perceptron underlies a given data set. In addition, recent deepening of 
the relations across learning, information and communication shows that the perceptron 
can be utilized as a useful building block for various coding schemes [H [9j [10J ITT] . 
Therefore, exploration to handle learning from correlated patterns may lead to the 
development of better schemes used for information and communication engineering. 

This paper is organized as follows. In the next section, we introduce the problem we 
are studying. In section 3, which is the main part of this article, we develop two schemes 
for analyzing the problem on the basis of the replica method and Thouless-Anderson- 
Palmer (TAP) approach. Statistical mechanical techniques that can handle correlated 
patterns have already been developed by Opper and Winther [121 [El Hlj- However, 
their schemes, which apply to densely connected networks of two-body interactions, are 
highly general, and therefore properties that hold specifically for perceptrons are not 
fully utilized. Hence, in this paper, we offer specific methodologies that can be utilized 
for perceptron type networks. We show that an integral formula provided for ensembles 
of rectangular random matrices plays important roles for the provided methods. A 
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message-passing algorithm to solve the developed TAP scheme is also presented. In 
section 4, the validity and utility of the methods are shown by applications to one 
known result and two example problems. The final section is a summary. 

2. Problem definition 

In a general scenario, for an iV-dimensional input pattern vector x, a perceptron which is 
parameterized by an iV-dimensional weight vector w can be identified with an indicator 
function of class label y = ±1, 



where T(y\A) = 1 — I(—y\A) takes 1 or depending on the value of internal potential 
A = N~ l l 2 w ■ x. Prefactor iV -1 / 2 is introduced to keep relevant variables 0(1) as 
iV — > oo. Equation ([1]) indicates that a perceptron specified by w correctly classifies a 
given labeled pattern (x,y) if I (y\A) = 1; otherwise, it does not make the correct 
classification. Let us suppose that a set of patterns Xi,x 2 , ■ ■ ■ ,x p is given. The 
problem we consider here is whether the perceptron can typically classify the patterns 
correctly by only adjusting w when the class label of each pattern x^, y^ 6 {+1, — 1}, 
is independently and randomly assigned with a probability of 1/2 for fi = 1, 2, ... ,p as 
N and p tend to infinity, keeping the pattern ratio a = p/N of the order of unity. 

In general, entries of pattern matrix X = N~ 1 ^ 2 (xi,x 2 , ■ ■ ■ ,x p ) T are correlated 
with each other, where T denotes the matrix transpose. As a basis for dealing with such 
correlations, we introduce an expression of the singular value decomposition 



of the pattern matrix X, where D = diag(rffc) is a p x N diagonal matrix composed of 
singular values c4 (k = 1, 2, . . . , minfjo, N)), and U and V arepxp and NxN orthogonal 
matrices, respectively. min(p, N) denotes the lesser value of p and N. Linear algebra 
guarantees that an arbitrary p x N matrix can be decomposed according to equation 
([2]). The singular values dk are linked to eigenvalues of the correlation matrix X T X, A& 
(k = 1,2, . . . , N), as Afe = d\ (k = 1, 2, . . . , mm(p, N)) and otherwise. The orthogonal 
matrices U and V constitute the eigen bases of correlation matrices XX T and X T X, 
respectively. In order to handle correlations in X analytically, we assume that U and 
V are uniformly and independently generated from the Haar measures of p x p and 
NxN orthogonal matrices, respectively, and that the empirical eigenvalue spectrum 



of X T X, N^J2^ =1 S(X - \ k ) = (1 - unn(p,N)/N)5(\) + iV" 1 EEf'^ 5(X - 0%), 



converges to a certain specific distribution p(A) in the large system limit of N, p — > oo, 
a = p/N ~ 0(1). Controlling p(A) allows us to characterize various second-order 
correlations in pattern matrix X. 

For generality and analytical tractability, let us assume that w obeys a factorizable 
distribution P(w) = YliLi P( w i) a priori. Given a labeled pattern set £ p = {X, y), 
where y = (yi, y 2 , ■ ■ ■ , y p ), it is possible to assess the volumes of w that are compatible 



2-(y|A), 



(1) 



X 



U'DV, 



(2) 
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with £ p as 

N v 

where A M = N _1 ^ 2 w -x^ (fJ,— 1,2, ... ,p) and Trw denotes the summation (or integral) 
over all possible states of w. Equation (j3J), which is sometimes referred to as the Gardner 
volume, is used for assessing whether £ p can be classified by a given type of perceptron 
because it is possible to choose an appropriate w that is fully consistent with £ p if and 
only if V(^ p ) does not vanish [T5] . 

In the large system limit, V(l; p ) typically vanishes and, therefore, £ p cannot be 
correctly classified by perceptrons of the given type when a becomes larger than a certain 
critical value a c , which is often termed perceptron capacity [T6l [15]. Since the mid- 
1980s, much effort has been made in the cross-disciplinary field of statistical mechanics 
and information science to assess a c in various systems pE]: in particular, for pattern 
matrices entries of which are independently drawn from an identical distribution of zero 
mean and variance N^ 1 . Such situations are characterized by the Marcenko-Pastur law 
p(X) = [1 — a} + 5(\) + (27r) _1 A _1 -\/[A — A_] + [A + — A] + in the current framework, where 
[x} + = x for x > and 0, otherwise, and A± = {\fa ± l) 2 [17]. However, it seems that 
little is known about how the correlations in pattern matrices, which are characterized 
by p(A) here, influence the perceptron capacity a c . Therefore, the main objective of the 
present article is to answer this question. 



3. Analysis 

3.1. A generalization of the Itzykson-Zuber integral 
The expression 

= J n (^fr^ ex p [- iu /AJ J (^i A ^)) S PK) exp t [uTXw ^ 

V N 

= n %> w n p (^) ex p [™ t *™] ( 4 ) 

fi=i i=i 

constitutes the basis for analyzing the behavior of equation ([3]), where i = y/— 1, 

u = (ui,u 2 , . . . ,u p ) T and Ty^u^) = f dA^exp [-m^AJ J(y (U |A At )/(27r). In order to 

evaluate the average of V(£, p ), we substitute equation (T5]) into equation (T4]) and take the 

average with respect to the orthogonal matrices U and V. For this assessment, it is 

worthwhile to note that for the fixed sets of dynamical variables w and u, w = Vw and 

u = Uu behave as continuous random variables that are uniformly generated under the 

strict constraints 

1 ..... 1 ... 
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p p 

when U and V are independently and uniformly generated from the Haar measures. In 
the limit as N, p — > oo, keeping a = p/N ~ 0(1), this yields the expression 

I In 

N 



exp [in T Xn;] 



In 



J dwduS (\w 


2 -NQ w )5(\u 


2 -pQ 


u )exp 


iu Dw 


J dwdu5 (\w 


2 -NQ w )5{\u 


2 - pQu) 



N 

= F(Q W ,Q U ), " (7) 

where 7TT denotes averaging with respect to the Haar measures, the function F(x,y) is 
assessed as 



F{x,y) = Extr |-± (ln(A,.A, + A)) p 

1 a 1 + a 
mx my , 

2 2 y 2 ' 



, , A x x aA v y] 
2 



1 



(8) 



and (■ • •) indicates averaging with respect to the asymptotic eigenvalue spectrum of 
X T X, p(A) [18J. The derivation of equations ([7]) and (jSJ) is shown in Appendix A. 
Extrg {■ • •} represents extremization with respect to 9. This corresponds to the saddle 
point assessment of a complex integral and does not necessarily mean the operation 
of a minimum or maximum. Expressions analogous to these equations are known as 
the Itzykson-Zuber integral or G-function for ensembles of square (symmetric) matrices 
[IHl 1201 [2U [221 [231 EH I2S1 [261 [2Z1- Equation © implies that the annealed average of 
equation (jHJ) is evaluated as 

■i In [V(t%P = Extr {F(Q W , Q u ) + A W {Q W ) + aA u (Q u )} , (9) 



where [• • -]^ p = 2 p Tr^ 
labeled patterns £ p and 



represents the average with respect to a set of randomly 



A U (Q U ) = Extr 



2 

QuQu 



Tr P(w) exp 



+ In 



2 « 



TrX y (u)exp — 




(10) 



Normalization constraints r Ti y X{y\/S) = 1 guarantee that [V A (^ P )]^ P = 2~ p , which 
implies that for any w the probability that each randomly labeled pattern {x^y^ 
(/i = 1,2, ... ,p) is correctly classified is equally 1/2 and, therefore, the size of feasible 
volume V(l; p ) decreases as 2 _p on average, regardless of correlations in X. In addition, 
in conjunction with equations Q), ffTUj) and fTTTl) . this implies that Q w = Tr w w 2 P(w), 
Qu = 0, Q w = and Q u = oT x Q w (A) . The physical implication is that, due to the 
central limit theorem, A = (A 1; A 2 , . . . , A P ) T follows an isotropic Gaussian distribution 



PA 



1 



exp 



2Qu 



(2nQ w {\) o )vn 



exp 



a|A| 2 

2Q w (\) t 



(12) 
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in the limit as N, p — > oo, a = p/N ~ 0(1) when w is generated from P(w) = 
n^i^O^i)' anc ^ ^ anc ^ ^ are independently and uniformly generated from the Haar 
measures. 



3.2. Replica analysis 



Now we are ready to analyze the typical behavior of equation (jHJ). Because £ p is a set of 
quenched random variables, we resort to the replica method [281 EH ED] . This indicates 
that we evaluate the n-th moment of V(^ p ) for natural numbers n G N as 



[V n {e) 



p / 1 n \ N / n 

^t=l \ a=l / i=l \a=l 




x exp 



o=l 



(13) 



and assess the quenched average of free energy with respect to the labeled pattern set £ p 
as iV -1 [In V(£ P )]£ P = lim n ^ \V n {£ s p )]^ p by analytically continuing expressions 

obtained for equation ( fTBl from n G N to real numbers n G R. Here, {iu a } and {ii a } 
represent sets of dynamical variables w 1 , . . . , n> ra and it 1 , . . . , it n , respectively, where 
1,2, ... ,n denote the n replicas of perceptrons. 

For this procedure, an explanation similar to that for the evaluation of equation 
(J7|) is useful. Namely, for fixed sets of dynamical variables {u a } and {n> a }, u a = Uu a 
and w a = Vw a behave as continuous random variables which satisfy strict constraints 

(15) 



— w ■ w 

N 

l~a ~b 
-It • It = 

p 



1 

= N W 
p 



w 



ab 
Hw i 



U 



ab 



(a, b = 1, . . . , n) when U and V are independently and uniformly generated from the 
Haar measures. This indicates that equation ffTB"]) can be evaluated by the saddle point 
method with respect to sets of macroscopic parameters Q w = (q^f) and Q u = (q° b ) in 
the limit as N, p — ■> oo, a = p/N ~ 0(1). In addition, intrinsic permutation symmetry 
among replicas indicates that it is natural to assume that n x n matrices Q w and Q u 
are of the replica symmetric (RS) form 

' Xw Qw Qw • • • Qw ' 
Qw Xw Qw • • • Qw 



V 



Xw 4" Qw / 





[ Xw 


+ nq w 





. 


. 









Xw 


. 


. 


E x 










Xw 


. 




\ 








. 


Xw 



x E i 



(16) 
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and 



Qu 



( Xu-Qu ~Qu ■ ■ ■ ~q u \ 
Qu Xu Qu ■ ■ ■ Qu 



\ -Qu 



'Qu ■ ■ ■ Xu Qu J 





I Xu 


- nq u 





. 


. 









Xu 


. 


. 


E x 










Xu 


. 




\ 








. 


Xu 



x E i 



(17) 



at the saddle point. Here, E = (e 1? e 2 , . . . , e n ) denotes an n-dimensional orthonormal 



basis composed of e.\ 



in 



1/2 -1/2 



n 



- X /2)T and 



n 



1 orthonormal vectors 



e 2 , e 3 , . . . , e„, which are orthogonal to e 1 . Equations (fT6|) and (fT7|) indicate that under 
the RS ansatz, the n replicas that are coupled with each other in equations ((Hj) and (|T5l) 
can be decoupled by rotating {w a } and {u a } with respect to the replica coordinates 
simultaneously with the use of the identical orthogonal matrix E. The already decoupled 
expression J22=i( ua ) T -^ wa = EI=i(^ n ) T -0^ tt i s kept invariant under this rotation. 
These operations imply that, in the new coordinates, the average with respect to U and 
V over uniform distributions of the Haar measures can be evaluated individually for 
each of the n decoupled modes, which yields 



I in 

N 



exp l 

a=l 

I Ua=i dw a du a C conp i cd exp [i J]" =1 (w a ) 1 Dw a ] 

I Ila=i dw a du a C coup i ed 
I Ua=i dw a du a C dccoup i cd exp [i X)" =1 (u a ) T 'Dw a ] 



I Ila=l ^"^"^decoupled 

F(Xw + nq w , Xu ~ nq u ) + (n - l)F(x U!) Xu) 5 



;is) 



where 



Coupled = n j (i™ a i 2 - n <x* + n 6 (™ a • ™ b - 

=1 a>b 
n 

Y[5(\u a \ 2 -p(xu - q u )) \\5iu 1 ■ u b + pq u ). 



a=l 



(19) 



a=l 



a>b 



and 



Cdecoupicd = 5(|it/| 2 - N(xw + nqu,)) \\b{\™ a \ 2 - Nq w ) 

a=2 
n 

x 5{\u\ 2 -p(xu ~ nq u ))Y[5{\u a \ 2 + pq u ). 



(20) 



a=2 
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Equation (I18p and evaluation of the volumes of dynamical variables {w a } and {u a } 
under constraints (fT4"j) and ffT5l) of the RS ansatz f[T6l) and f|T7|) provide an expression 
for the average free energy 



l[i,n/(r)v = nm|-li„[nr)v 

= Extr {A(x™, Xu, Qw, q u ) + -Av(xw, q w ) + q u )} 



where © = x«, <?«,, g«) 



A(x ) — F(xvi, Xu) + q\i 



dF( X W) Xli 

dXw 



q u - 



dF(x w ,Xu) 



dXu 



(21) 



(22) 



W ! 

+ / Dzln 



Extr 



x* 



Tr P(w) exp 



Xu) 2 i 

—w + v q w zw 



(23) 



and 



A(x«, 9«) = Extr 
+ I I): In 

2 3/ 



Xu / \ , Qu 



J Dz\n J DxX{y\y/% x + V$u z ) j • 



(24) 

Here, .Ds = (is exp [— s 2 /2] /a/27t represents the Gaussian measure. 

Two points should be noted here. The first is that the current formalism can be 
applied not only to the RS analysis presented above but also to that of replica symmetry 
breaking (RSB) [29J, [30]. An expression of the average free energy under the one-step 
RSB (1RSB) ansatz is shown in Appendix B. In addition, analysis of the local instability 
condition of the RS solution ffl6l) and (IT7|) subject to infinitesimal perturbation of the 
form of 1RSB yields 



2 



2d 2 F 



(2) 



a dx. 



2 



d 2 F 
oc \dxwdxi 



Xl 2) xi 2) < 0, (25) 



where 



and 



Y (2) 
A.w 



Dz 



d 2 



d (y/qv,Z 



In 



Tr P(w) exp 



Xw 2 i 

— w + \/q w zw 



,(26) 



Y (2) 



-Tr I Dz 

2 3/ 



5 2 



<9 



In 



y L>sJ (y\\^2uX + \fq\z 



• (27) 



Equation (1251) corresponds to the de Almeida-Thouless (AT) condition for the current 
system [31J. The second point is that although randomly labeled patterns are assumed 
here, one can develop a similar framework for analyzing the teacher-student scenario, 
which assigns pattern labels by a teacher perceptron, and which has a deep link to a 
certain class of modern wireless communication systems [H [32j [33J, [3U [351 USE EH [38j 
EH [25]. One can find details of the framework in reference |18j . 
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3.3. Thouless- Anderson-Palmer approach and message-passing algorithm 

The scheme developed so far is used for investigating typical macroscopic properties 
of perceptrons which are averaged over pattern set £ p . However, another method is 
necessary to evaluate microscopic properties of a perceptron for an individual sample 
of £ p . The Thouless- Anderson-Palmer (TAP) approach [39], originating in spin glass 
research, offers a useful guideline for this purpose. Although several formalisms are 
known for this approximation scheme [6], we follow the one based on the Gibbs free 
energy because of its generality and wide applicability [HJ [22] . 

Let us suppose a situation for which the microscopic averages of the dynamical 
variables, 

m w = Tr wP(w\£ p ) 
w 

p N 

= vw) w w n ^ M n p w ex p . (28) 



and 



N 



m « = vtka ,s„ ^ n ^ w n ex p k^h » ( 29 ) 



are required, where P(w\C, p ) = r| p =1 T(?/ M | A M ) YLi=iP( w i)/V(£ P ) denotes the posterior 
distribution of w given £ p . The Gibbs free energy 



$(m w , m u ) = Extr {h w ■ m w + h u ■ m u - In [V^, , (30) 



where 



V(h w , h % 



N 



= JwTl II P ( w *) exp [hww + h u - (xu) + [iufXw] , (31) 

/i=l i=l 

offers a useful basis because the extremization conditions of equation (J30l) generally agree 
with equations (1251) and (I2"9"j) . This indicates that one can evaluate the microscopic 
averages in equations ff28|) and fl29|) by extremization, which leads to assessment of 
the correct free energy, since lnV^(^ p ) = — Extr{ mu ,,m.„} { ( &(m w , m u )} holds, once the 
function of Gibbs free energy (1301 is provided. 

Unfortunately, an exact evaluation of equation (I30p is computationally difficult and 
therefore we resort to approximation. For this purpose, we put parameter I in front of 
X in equation (|3~1|) . which yields the generalized Gibbs free energy as 



$(m to , m u ; I) = Extr {h w ■ m w + h u -m u - In [V(h w , h u ; I)}} , (32) 

where V(h w ,h u ;l) is defined by replacing X with IX in equation (|3T|) . This implies 
that the correct Gibbs free energy in equation ( 130]) can be obtained as <&(m w , m u ) = 
Q(m w ,m u ; I = 1) by setting I = 1 in the generalized expression (|32|) . One scheme 
for utilizing this relation is to perform the Taylor expansion around I = 0, for which 
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$>(m w ,m u ;l) can be analytically calculated as an exceptional case, and substitute 
I — 1 in the expression obtained, which is sometimes referred to as the Plefka 
expansion [40] . However, evaluation of higher-order terms, which are non-negligible 
for correlated patterns in general, requires a complicated calculation in this expansion, 
which sometimes prevents the scheme from being practical. In order to avoid this 
difficulty, we take an alternative approach here, which is inspired by a derivative of 
equation fl32l) . 

Oi = -(W H- (33) 

where (■ ■ -) ; represents the average with respect to the generalized weight YfL=i ^j/ m ( u m) x 
YliLi P( w i) x ex P \h"w ' w + h u ■ (iu) + iiuj 1 (lX)w\ , and h w and h u are determined to 
satisfy = m w and ((iii)) ; = m u , respectively [H]. The right-hand side of this 
equation is the average of a quadratic form containing many random variables. The 
central limit theorem implies that such an average does not depend on details of the 
objective distribution but is determined only by the values of the first and second 
moments. In order to construct a simple approximation scheme, let us assume that 
the second moments are characterized macroscopically by (\w\ 2 ) l — \ (w) l | 2 = N\ w and 
(|n| 2 ) ; — | (u) t | 2 = pXu- Evaluating the right-hand side of equation (|33|) using a Gaussian 
distribution for which the first and second moments are constrained as (w) l = m w , 
{(iu)) l = m u , (\w\ 2 ) l - | (w) l | 2 = Nxw and (|w| 2 } ; - | (u) l | 2 = px u , and integrating 
from I — to I — 1 yields 

®(Xw, Xu, m w , m u ; 1) - $(xw, Xu, m w , m u ; 0) 

~ -mlXm w - NF( Xw , Xu), (34) 

where the function F(x,y) is provided as in equation ([8]) by the empirical eigenvalue 
spectrum of X T X, p(A) = iV" 1 J2k=i <^(^ — ^fc) and the macroscopic second moments Xw 
and Xu are included in arguments of the Gibbs free energy because the right-hand side 
of equation ( l33l) depends on them. Utilizing this and evaluating &(xw, Xu, m w , n^u] 0), 
which is not computationally difficult since interaction terms are not included, yield an 
approximation of the Gibbs free energy as 

HXw, Xu, m w , m u ) ~ -mlXm w - NF(x w , Xu) 

+ Extr I h w ■ m w - ~x w (Nx w + |^«,| 2 ) 
Xw,hw L ^ 

N 

\TtP(w)e-^ ww2+hmW ^ 



-EM- 

i=l 

+ Extr \h u ■ m u - -£« {pXu - \m u \ 2 ) 
xu,hu L ^ 



uu ) 



(35) 



-5> 

u=l 

which is a general expression of the TAP free energy of the current system. 
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Extremization of this equation provides a set of TAP equations 

d 



m 7r 



dh 



Wl 

N 



In Tr P(w)e 



N ^ 



d 2 



dh 



In 



Tr P(w)e 

. w 







m 



till 



dh 



In 



U/J, 



Dxl(yJ\/Xu^ + Ka) 



where 



Xu 



h, 



h„ 



Xu 



1 p 

-J2 



o 2 



in 



X T m u 

_ d 



d 



dxt 



dxi 



Xm w + 
2 d 



2 d 
adxu 



w i Xu ) ; 



(36) 
(37) 
(38) 
(39) 

(40) 
(41) 
(42) 
(43) 

adxu 

solutions of which represent approximate values of the first and second moments of the 
posterior distribution P(w,u\£ p ) for a fixed sample of £ p . In equations (jlD|) and 
-2(d/dxw)F(xw, Xu)m w and (2/a)(d/dx u )F(xw, Xu)m u are generally referred to as the 
Onsager reaction terms. The counterparts of these equations for systems of two-body 
interactions have been presented in an earlier paper [22J. 

Solving TAP equations ( !3~6"l) -(l43~ l) is not a trivial task. Empirically, naive iterative 
substitution of these equations does not converge in most cases. Conversely, it is reported 
that message-passing (MP) algorithms of a certain type, which are developed on the 
basis of the belief propagation [?T], exhibit excellent solution search performance for 
pattern sets entries of which are IID with low computational cost [32l H2]. Therefore, 
we developed an MP algorithm as a promising heuristic that reproduces known, efficient 
algorithms for IID pattern matrices. A pseudocode of the proposed algorithm is shown 
in figure [H One can generalize this algorithm to the case of probabilistic perceptrons by 
replacing the indicator function I(y\A) with a certain conditional probability P(y\A). 
It should be noted that A w and A u in the algorithm denote the counterparts of A^ 
and A y in equation (jSJ) for x = Xw and y = Xu, respectively. Solving (xu,A u ) and 
(Xw,A w ) in H-Step and V-Step, respectively, can be performed efficiently by use 
of the bisection method. Solving the TAP equations employing this algorithm yields 
approximate estimates of the free energy In V(£, p ) and its derivatives as well as m w and 
m u , which can be utilized for assessing whether the given specific sample £ p can be 
correctly classified by the perceptron. 

Although we have assumed single macroscopic constraints as characterizing the 
second moments, the current formalism can be generalized to include component- 
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MPforPerceptronj 

Perform Initialization; 

Iterate H-Step and V-Step alternately sufficient times; 

} 

Initialization! 

1 N 1 

m wi ^TrwiP(wi) (i = 1, 2, . . . , N); 

Wi 

h u ^Xm w ; m/,— 0: 

} 

H-Step{ 



Search (x„,A M ) for given (Xw,A- w ) to satisfy conditions 

/ K \ , n -u 1 -i / A 



A W A U + A 



A,, 



1 



A w A u + \/ p ' 



Xu^ A u ; 

h u — XuTHu') 

9 l 
m ull <-— — In 



h„,^X L m, l 



J Dxl(y fJ \ + h Ufl ) 



(H = 1,2,. . . ,p); 



i K 9 2 r /■ 



A„< x„; 

Xu 



} 

V-Step{ 



Search A w ) for given (% u , A u ) to satisfy conditions 



Xu 



A W A U + A 



1 



and x u = (1 — a 1 )- h a 1 



1 



A W A M + A/ p 



Xw i A^; 

Xw 

9 i 

Otl w i 

h u ^Xm w ; 
1 



N Q2 



X 



E 



A J__- ■ 

* Xw ) 



TYP^e - ^™^-™ (i = 1,2,..., N); 
In TrP(u;)e"5^«' 2 +^»l . 



12 



Figure 1. Pseudocode of the proposed message-passing algorithm MPforPercep- 

tron. ";" and "<— " represent the end of a command line and the operation of substi- 
tution, respectively. 
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wise multiple constraints for constructing more accurate approximations. By doing 
this, the current formalism leads to the adaptive TAP approach or, more generally, 
to the expectation consistent approximate schemes developed by Opper and Winther 



4. Examples 

4-1. Independently and identically distributed patterns 

In order to investigate the relationship with existing results, let us first apply the 
developed methodologies to the case in which the entries of X are IID of zero mean 
and variance N~ l . This case can be characterized by the eigenvalue spectrum of the 
Marcenko-Pastur type, which was already mentioned in section 2 and yields 



a 



F(x,y) : 

This implies that equation fl22|) can be expressed as 



•Aoi^Xwi Xuy Qwi Qu) 



a 



(44) 



(45) 



Inserting this into equation (I2T!) and then performing an extremization with respect to 
Xu and q u yields 

Xu Xwi Qu Qwj (^6) 

where Xu an d q u are the variational variables used in equation ( |24l) . This implies that 
the replica symmetric free energy ( 12T1) can be expressed as 
1 



N 



[In 



Extr <^ A w {xw, q w ) + 77 Tr / Dz In / Dxl (yly^x + Jq^z) \ . 
Xw,q w 1 2 y J U J J 



(47) 



This is equivalent to the general expression of the replica symmetric free energy of a 
single-layer perceptron for the IID pattern matrices and randomly assigned labels [4~|, [4~3] . 



4-2. Rank deficient patterns vs. spherical weights 

In data analysis, the property of pattern components strongly correlated with each 
other is referred to as multicollinearity, which sometimes requires special treatment. As 
a second example, we utilize the developed framework to examine how this property 
influences a c . 

Strong correlations among components can be modeled by rank deficiency of the 
cross-correlation matrix X T X. In the current framework, this is characterized by an 
eigenvalue spectrum of the form 

p(X) = (l-c)5(X)+cp(X), (48) 

where < c < 1 denotes the ratio between the rank of X T X and N, and p(A) is a certain 
distribution the support of which is defined over a region of A > 0. For simplicity, let 
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Figure 2. Assessment of a c of spherical weights for rank deficient pattern matrices. 
For N — 4, 8, 12, . . . , 40, the critical pattern ratio a c (N), which is defined as the 
average of the maximum pattern ratio above which no weight can correctly classify 
a given sample of £ p , was assessed from 10 4 experiments. Each estimate of a c (N) 
was obtained by extrapolating a c (N, T max ), which is an average value of a above 
which the perception learning algorithm [47j does not converge after the number of 
updates reaches T max for a given sample of £ p with respect to T max = 10 3 ~2x 10 4 . 
The capacity is estimated by a quadratic fitting under the assumption of a c (N) ~ 
a c + aN^ 1 + bN~ 2 where a and b are adjustable parameters, (a) and (b) represent 



results for p(X) = (2-kX)^ 1 [A — {\fajc — l) 2 ]+[(y / a/c + l) 2 — A]+ and 5(X - 1) 
respectively. For both cases, each data corresponds to c = 1/4, 1/2, 3/4 and 1 from the 
bottom. The estimates of a c show excellent consistency with the theoretical prediction 
a c = 2c regardless of p(X). 



us limit ourselves to the case of simple perceptron and spherical weights, for which 
X(y\A) = 1 for yA > and 0, otherwise, and P(w) = 5(\w\ 2 — N). Inserting these into 
equation ff2T]) offers a set of saddle point equations. Among them, those relevant for 
capacity analysis are 

x „=fl-£) 1- + £/ A=^\, (50) 



& = _£a%*0 =: L_ An (51) 

\„ - / /,': - Z= , . h < n 1 1 i: --> 



aJ A„ a \A W A U +A/- 

W ) X.U ) -*- 

a dxu Xu 
d 2 

where H(x) = Dz. 

Let us assume that no RSB occurs for a < a c , as is the case for IID patterns. 
Under this assumption, a critical condition is offered by taking a limit \ u — > 0, which 
implies that the variance of A = (A 1; A 2 , . . . ,A P ) of the posterior distribution for a 
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Figure 3. (a): Entropy of w (per element) versus the pattern ratio a. The curve 
represents the theoretical prediction assessed by the replica method and the markers 
denote experimental data obtained by MPforPerceptron for 100 samples of £ p of 
N = 500 systems, (b): Diagnosis of the AT stability. 5^ T , which is the inverse of the 
left-hand side of equation (f25|) . is plotted versus a for the assessed RS solution. 5^ 
becomes negative for a > chat — 0.810, indicating the occurrence of RSB. 



given sample £ p typically vanishes. Applying an asymptotic form, lnif(x) ~ —x 2 /2 for 
x 1, to equation (1521) in conjunction with equation (151]) yields 

(53) 

Inserting this into equation fl50l) gives 

1 - - ( l ~, — ) > 0, (54) 

a a \A W + 2Xxu / ? 

This means that no RS solution can exist for a > 2c, indicating that the perceptron 
capacity is given as 

a c = 2c, (55) 

regardless of p(A). Equation (!55j) is consistent with the known result a c = 2 for IID 
patterns [16j [IB], for which c = 1 as X T X is typically of full rank for a > 1. Numerical 
experiments for rank deficient pattern matrices support the present analysis, which is 
shown in figure [2j 

4-3. Random orthogonal patterns vs. binary weights 

Equation f)55p means that the capacity depends only on the rank of the cross-correlation 
matrix X T X in the case of spherical weights; however, this is not always the case. To 
show this, we present a capacity problem of binary weights w = {+1, —1}^ as the final 
example. 

It is known that in typical cases, simple perceptrons of binary weights can correctly 
classify randomly labeled IID patterns for a < a c ~ 0.833 [HI SSI SSj- O ur question 
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Figure 4. Results of exhaustive search experiments. For TV = 2, 3,..., 20, a c (N), 
which is defined in the caption of figure [21 were estimated from 10 6 experiments 
performed by an exhaustive search of binary weights. The values of capacity a c are 
estimated by employing a quadratic fitting similar to that explained in the caption of 
figure[2[ For IID patterns, this yields an estimate of a c ~ 0.819, whereas the theoretical 
prediction is 0.833 and is considered as exact. The estimate a c ~ 0.938 for the random 
orthogonal patterns is reasonably close to the theoretical prediction 0.940, which is 
obtained from the unstable RS solution. 



here is how a c is modified when the pattern matrix X is generated randomly in such a 
way that patterns orthogonal to each other. 

To answer this question, we employ the replica and TAP methods developed in the 
preceding sections for p(A) = (1 — a)8(X) + aS(X — 1), which represents the eigenvalue 
spectrum of the random orthogonal patterns assuming < a < 1 and yields 

F(x,y) = -l+(c-^\nC\, (56) 

where C = 2 _1 (l ± y/1 — Aaxy). Here, ±1 is chosen so that the operation of 
ExtrA^A,,]/ • •} in equation (jSJ) corresponds to the correct saddle point evaluation of 
equation (1A.3j) . Figure [3] (a) shows how the entropy of w depends on the pattern ratio 
a. The curve denotes the theoretical prediction of the replica analysis and the markers 
denote the averages of the entropy obtained by the TAP method over 100 samples for 
iV = 500 systems. The error bars are smaller than the markers. Solutions of the TAP 
method are obtained by MPforPerceptron, shown in figured] Although the curve and 
the markers exhibit excellent agreement for data points a = 0.1,0.2, . . . ,0.8, we were 
not able to obtain a reliable result for a = 0.9, at which point this algorithm does not 
converge in most cases, even after 1000 iterations. This may be a consequence of RSB 
since the replica analysis indicates that the AT stability of the RS solution shown in 
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figure [3] (a) is broken for a beyond a at — 0.810 (see figure [3] (b)). Therefore a c ~ 0.940, 
indicated by the condition of vanishing entropy is not regarded as the exact, but as an 
approximate value provided by the unstable RS solution. However, extrapolation of the 
results of direct numerical experiments for finite-size systems indicates that a c ~ 0.938, 
as shown in figure HI which implies that the effect of RSB is not significant for the 
evaluation of a c in this particular case. 



5. Summary 

We developed a framework for analyzing the classification problems of perceptrons for 
randomly labeled patterns. The development is intended to handle correlated patterns. 
For this purpose, we developed two methodologies based on the replica method and the 
Thouless-Anderson-Palmer (TAP) approach, which are standard techniques from the 
statistical mechanics of disordered systems, and introduced a certain specific random 
assumption about the singular value decomposition of the pattern matrix. In both 
schemes, an integral formula, which can be regarded as a generalization of the Itzykson- 
Zuber integral known for square (symmetric) matrices, plays an important role. As a 
promising heuristic for solving TAP equations, we provided a message-passing algorithm 
MPforPerceptron. The validity and utility of the developed schemes are shown for 
one known result and two novel problems. 

Investigation of the properties of MPforPerceptron, as well as application of 
the developed framework to real-world data analysis [121 US] and various models of 
information and communication engineering [4*91 IT?] , are promising topics for future 
research. 
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Appendix A. Derivation of Equations ([71) and ([8]) 

The expressions 



5(\w\ 2 -Nx) 



5(\u\ 2 -py) 



1 f +io ° dA x 

2^ Loo ~ 

00 dA,, 
2«i ./-ioo ~2" 



exp 



-A(H 2 -iVz) 



exp 



-^-(\u\ 2 - P y) 



yield an integral 



dwdu5(\w\ 2 — Nx)5(\u\ 2 — py) exp 



iu Dw 



(A.l) 
(A.2) 
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(4vri) 



dA x dA y ( / dwduexp 
x exp 



AJw\ 



Ay\u\ 



+ iu Dw 



NA x x pA y y 



( 27r )(JV+p)/2 
(47Ti) 2 



dA x dA y ( dot 
x exp 





—iD T 


-LD 


Aylp 



NA x x pA y y 



-1/2 



(A.3) 



2 2 

where ijv an d ip are N x N and p x p identity matrices, respectively. Linear algebra 
can be used to generate the expression 

min(p,A r ) 

In dot ' 7—= — = ^(Aa-A,, + A fc ) 

fc=i 



A x /at 


-i£> T 


-LD 


Ayl p 



+ (AT - min(p, A/")) In A,, 

~ AT ((ln^A, + A)) p + (a - 1) In A,) , 

in the large system limit AT, p — > 00, keeping a = p/iV ~ 0(1). This implies that 
equation ( 1A.3I) can be evaluated by the saddle point method as 



(A.4) 



lln 

N 



dwduS(\w\ 2 — Nx)5(\u\ 2 — py) exp 



iit £>«; 



= grtrj-i (InCA.A, + A)) p -^ In A, + ^ + ^ J+ CO n S t, (A.5) 

where const represents constant terms that do not depend on either x or y. In particular, 
setting D = in this expression leads to 



lln 

N 



J dwdu5(\w\ 2 - Nx)5(\u\ 2 - py) 



I- 



1 , . a , . A x x aA v y 1 

In A^ lnA„ + + — — > 

2 v 2 2 J 

1 + a 

- In x H In y H h const. 

2 2 y 2 

Equations flA.5j) and (1A.6I) are used in equations ([7]) and (jSJ) 



Extr 

Ao:,A s [ 2 

1 , a 



(A.6) 



Appendix B. Assessment of free energy under the 1RSB ansatz 

The argument in section 3 implies that when nxn matrices Q w = (q^f ) and Q u = (q® b ) 
are simultaneously diagonalized by an identical orthogonal matrix, the average of the 
replicated coupling term with respect to U and V is evaluated as 

n 
a=l 

where and t a u [a = l,2,...,n) denote a pair of eigenvalues of Q w and <2„ and 
correspond to an identical eigen vector. Under the 1RSB ansatz, n replica indices are 



N 



In 





n 


exp 
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divided into n/m groups of identical size m, and the relevant saddle point is characterized 

as 

{Xw + v w + q w ,Xu ~ v u - q u ), a = b, 
{v w + q w ,-v u -q u ), a and b belong 

to an identical group, 
(q w ,-q u ), otherwise, 
where m serves as Parisi's RSB parameter after analytical continuation. Q w and Q u 
of the form of equation flB.2j) can be simultaneously diagonalized, which yields pairs of 
eigenvalues as 

{(Xw + mv w + nq w ,Xu-mv u -nq u ), 1, 
(Xw + mv w ,Xu-mv u ), n/m - 1, (B.3) 

(Xw,Xu), n-n/m, 

where the numbers in the right-most column represent the degeneracies of the pair of 
eigenvalues denoted in the middle column. This gives 



1 

N 



In 



exp 



a=l 



= F(x w + mv w + nq w ,Xu-rnv u -nq u ) + (— -l) F(x w + mv w ,Xu-mv u ) 

\m / 

+ (n--)F( Xw ,Xu)- (B.4) 
V mJ 

Equation ( 1B.4I) and assessment of the volumes of dynamical variables {w a } and {u a } 
under the 1RSB ansatz flB.21) . in conjunction with analytical continuation from nGN 
to n e M, lead to the expression of the 1RSB free energy as 



^[Inn^limJ^lnra. 



Extr {Al RSB (Xw, Xu, v w , v u , q w , q u \ m) 



& 



+ X RSB (X», v w , q w ; m) + aA L ™ B (Xm v u , q u ; m) \ , (B.5) 



1RSB, 



where = (\ w , Xu, v w , v u , q w , q u ), 

Aq ij'Cw) Xui qwi qui nij 



F{Xw, Xu) H (F(Xw + mv w , Xu - mv u ) - F(x w , Xu)) 



m 



dF(x w + mv w , Xu ~ mv u ) dF( Xw + rnv w , Xu - mv u ) 
+ q w qu ~ , (B.6) 



dXv 



dXv 



A w (Xun Vwi qun m ) 



^Extr 

Xw,v w 3 
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Xw(Xw + v w + q w ) v w (Xw +m(v w + q w )) q w (Xw + mv w ) 



+— I Dzln 
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V w 



(B.7) 



and 



A^ SB (Xu,v u ,q u ;m) 
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